Protein higher order structures determines its function.
1937 human proteins have unknown role (dark proteome) (Young-Ki Paik et al., 2018).
Development of methods for predicting protein properties on the basis of their primary structure in a way that is understandable for biologists and experimentally validated.
n-grams (k-tuple, k-mers):
Peptide I: FKVWPDHGSG
Peptide II: YMCIYRAQTN
n-gram examples from peptide I and II:
Longer n-grams are more informative, but create larger attribute spaces that are more difficult to analyze.
Counting n-grams creates sparse matrices, that are causing dimensional problems.
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.3 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ crayon::%+%() masks ggplot2::%+%()
## ✖ dplyr::combine() masks gridExtra::combine()
## ✖ seqinr::count() masks dplyr::count()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## % latex table generated in R 3.6.1 by xtable 1.8-3 package
## % Mon Sep 23 19:52:20 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrll}
## \hline
## & Number of sparse matrices & Package & File size [Mb] \\
## \hline
## 1 & 1.00 & base & 0.000214 Mb \\
## 2 & 1.00 & slam & 0.001122 Mb \\
## 3 & 10.00 & base & 0.000969 Mb \\
## 4 & 10.00 & slam & 0.001312 Mb \\
## 5 & 100.00 & base & 0.0765 Mb \\
## 6 & 100.00 & slam & 0.002625 Mb \\
## 7 & 1000.00 & base & 7.629601 Mb \\
## 8 & 1000.00 & slam & 0.016357 Mb \\
## 9 & 10000.00 & base & 762.939659 Mb \\
## 10 & 10000.00 & slam & 0.153687 Mb \\
## \hline
## \end{tabular}
## \end{table}
Reduced alphabets:
Following peptides appear to be completely different in terms of amino acid composition.
Peptide I:
FKVWPDHGSG
Peptide II:
YMCIYRAQTN
## Warning: `show_guide` has been deprecated. Please use `show.legend`
## instead.
| Group | Amino acids |
|---|---|
| 1 | C, I, L, K, M, F, P, W, Y, V |
| 2 | A, D, E, G, H, N, Q, R, S, T |
Peptide I: FKVWPDHGSG —–> 1111122222
Peptide II: YMCIYRAQTN —–> 1111122222Amyloid aggregates are found in tissues of people suffering from neurodegenerative disorders such as Alzheimer’s disease, Parkinson’s disease and many other diseases.
Amyloid aggregates (red) around neurons (green). Strittmatter Laboratory, Yale University.
Source: National Institute on Aging (NIA) | National Institutes of Health (NIH)
Peptide sequences with amyloidogenic properties are responsible for the aggregation of amyloidogenic proteins (hot spots):
(Sawaya et al. 2007)
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Quick Permutation Test is a fast alternative to permutation tests for n-gram data. It also allows precise estimation of p-value.
QuiPT is avaible as part of the biogram R package.
| Package | Runtime [h] | Memory usage [GB] | ||
|---|---|---|---|---|
| mtry= | ||||
| 5000 | 15,000 | 135,000 | ||
| randomForest | 101.24 | 116.15 | 248.60 | 39.05 |
| randomForest (MC) | 32.10 | 53.84 | 110.85 | 105.77 |
| bigrf | NA | NA | NA | NA |
| randomForestSRC | 1.27 | 3.16 | 14.55 | 46.82 |
| Random Jungle | 1.51 | 3.60 | 12.83 | 0.40 |
| Rborist | NA | NA | NA | >128 |
| ranger | 0.56 | 1.05 | 4.58 | 11.26 |
| ranger (save.memory) | 0.93 | 2.39 | 11.15 | 0.24 |
| ranger (GWAS mode) | 0.23 | 0.51 | 2.32 | 0.23 |
Runtime and memory usage for the analysis of a simulated dataset mimicking a genome-wide association study (GWAS). NA values indicate unsuccessful analyses:
without disk caching failed because of memory shortage for all mtry values and number of CPU cores.
With disk caching, we stopped bigrf after 16 days of computation.}
Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77
Marvin N. Wright and Andreas Ziegler. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software 1, 77
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Do standard reduced alphabets developed for different biological issues help to improve amyloid prediction?
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Standard amino acid alphabets do not improve the quality of amyloid prediction.
% Standard reduced amino acid alphabets do not enhance discrimination between amyloidogenic and non-amyloidogenic proteins.
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
17 measures handpicked from AAIndex database: - size of residues, - hydrophobicity, - solvent surface area, - frequency in \(\beta\)-sheets, - contactivity.
524 284 amino acid reduced alphabets with different level of amino acid alphabet reduction (three to six amino acid groups).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
% %Hinges of boxes correspond to %the 0.25 and 0.75 quartiles. The bar inside the box represents the median. The %gray circles correspond to the reduced alphabets with the AUC outside the 0.95 %confidence interval.
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
## Using alph as id variables
For each category the alphabets have been ranked (rank 1 for the best AUC, etc.).
The best alphabet was the one with the lowest rank sum.
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
\begin{frame}{Best-performing reduced alphabet}
Group 3 i 4 - hydrophobic amino acids.
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
Group 2 - amino acids disrupting the \(\beta\)-structure (\(\beta\)-breakers).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
% Czy alfabety podobne do najlepszego uproszczonego alfabetu również wspierają przewidywania amyloidów? % Do alphabets similar to the best reduced alphabet also support amyloid predictions? Is the best-performing reduced amino acid alphabet associated with amyloidogenicity?
Similarity index (Stephenson and Freeland 2013) measures the similarity between two reduced alphabets (1:~identical alphabets, 0:~completely dissimilar alphabets).
The correlation between the similarity index and the average AUC is important (\(\textrm{p-value} \leq 2.2^{-16}\); \(\rho = 0.51\)).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
% Are the informative n-grams found by QuiPT are connected with amyloidogenicity? Are informative n-grams found by QuiPT associated with amyloidogenicity?
## Using decoded_name, association, amyloid as id variables
Out of 65 the most informative n-grams, 15 (23%) were also found in the motifs validated experimentally (Paz and Serrano 2004).
## Using decoded_name, association, amyloid as id variables
% Spośród 65 najbardziej informatywnych n-gramów, 15 (23%) jest również obecnych w motywach aminokwasowych znalezionych ekperymentalnie (Paz and Serrano 2004).
Of the 65 most informative n-grams, 15 (23%) are also present in amino acid motifs found experimentally (Paz and Serrano 2004).
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
The classifier trained using the best reduced alphabet, AmyloGram, has been compared with other amyloid prediction tools using an external dataset .
MCC (Matthew’s Correlation Coefficient) measures the performance of a classifier (1 - classifier always properly recognizes amyloid proteins, -1 - classifier never properly recognizes amyloid proteins)
Burdukiewicz, M., Sobczyk, P., Rödiger, S., Duda-Madej, A., Mackiewicz, P., and Kotulska, M. (2017). Amyloidogenic motifs revealed by n-gram analysis. Scientific Reports 7, 12961
A new functional amyloid produced by Methanospirillum sp. (Christensen et al. 2018) was selected for analysis by AmyloGram.
\begin{frame}{Summary}
Web servers:
R packages:
Funding:
Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2016. “Prediction of Amyloidogenicity Based on the N-Gram Analysis.” e2390v1. PeerJ Preprints. https://peerj.com/preprints/2390.
Burdukiewicz, Michał, Piotr Sobczyk, Stefan Rödiger, Anna Duda-Madej, Paweł Mackiewicz, and Małgorzata Kotulska. 2017. “Amyloidogenic Motifs Revealed by N-Gram Analysis.” Scientific Reports 7 (1): 12961. doi:10.1038/s41598-017-13210-9.
Christensen, Line Friis Bakmann, Lonnie Maria Hansen, Kai Finster, Gunna Christiansen, Per Halkjær Nielsen, Daniel Erik Otzen, and Morten Simonsen Dueholm. 2018. “The Sheaths of Methanospirillum Are Made of a New Type of Amyloid Protein.” Frontiers in Microbiology 9: 2729. doi:10.3389/fmicb.2018.02729.
Murphy, Lynne Reed, Anders Wallqvist, and Ronald M. Levy. 2000. “Simplified Amino Acid Alphabets for Protein Fold Recognition and Implications for Folding.” Protein Engineering 13 (3): 149–52. doi:10.1093/protein/13.3.149.
Paz, Manuela López de la, and Luis Serrano. 2004. “Sequence Determinants of Amyloid Fibril Formation.” Proceedings of the National Academy of Sciences 101 (1): 87–92. doi:10.1073/pnas.2634884100.
Sawaya, Michael R., Shilpa Sambashivan, Rebecca Nelson, Magdalena I. Ivanova, Stuart A. Sievers, Marcin I. Apostol, Michael J. Thompson, et al. 2007. “Atomic Structures of Amyloid Cross-β Spines Reveal Varied Steric Zippers.” Nature 447 (7143): 453–57. doi:10.1038/nature05695.
Stephenson, James D., and Stephen J. Freeland. 2013. “Unearthing the Root of Amino Acid Similarity.” Journal of Molecular Evolution 77 (4): 159–69. doi:10.1007/s00239-013-9565-0.